knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(lubridate)
library(gridExtra)
library(ggpubr)
library(kableExtra)
library(emmeans)
library(Hmisc)
library(ggcorrplot)
library(grid)
library(broom)
library(reshape2)
library(car)
options(width=100)

This is to certify that the work I am submitting is my own. All external references and sources are clearly acknowledged and identified within the contents. I am aware of the University of Warwick regulation concerning plagiarism and collusion.

No substantial part(s) of the work submitted here has also been submitted by me in other assessments for accredited courses of study, and I acknowledge that if this has been done an appropriate reduction in the mark I might otherwise have received will be made.

AI was used in the preparation of this work. It was used in the development of the code: It was used to provide example uses of functions or approaches to elements of the challenges which were then interpreted by me and modified to be applicable to this data/report.


QUESTION 1: Cardiovascular Disease in England

Section 1

Introduction to Analysis

This report analyzes factors influencing the prevalence of Cardiovascular Disease (CVD) across various local authority regions in England. Using data from the UK Office for National Statistics, we examine how overweight prevalence, smoking rates, wellbeing scores, and poverty levels affect CVD rates. The goal is to identify significant predictors and visualize the impact of poverty on CVD prevalence.

The analysis includes:

  1. Data Integrity Checks: Ensuring the dataset’s reliability by identifying and addressing missing or inconsistent data.
  2. Visualizations: Creating graphs to illustrate the distribution of CVD rates and explore relationships with influencing factors.
  3. Multiple Linear Regression: Assessing the impact of overweight, smoking, wellbeing, and poverty on CVD prevalence.
  4. Interpretation of Results: Translating statistical findings into clear recommendations for public health strategies.

The report is designed to be reproducible, with well-documented and organized code, enabling future analysts to adapt the analysis to new or updated datasets seamlessly.


Data Dictionary

The data dictionary provides an overview of the variables included in the dataset used for the analysis. It outlines each variable’s name, data type, and a brief description of its purpose or content.

# Create the data dictionary as a data frame
data_dictionary <- data.frame(
  Column_Name = c("area_name", "area_code", "Population", "Poverty", 
                  "CVD", "overweight", "smokers", "wellbeing"),
  Data_Type = c("character", "character", "numeric", "numeric", 
                "numeric", "numeric", "numeric", "numeric"),
  Description = c(
    "Name of the local authority area.",
    "Unique code for the local authority.",
    "Total population of the area.",
    "Percentage of population in poverty.",
    "Percentage with cardiovascular disease.",
    "Proportion of overweight individuals.",
    "Proportion of smokers in the area.",
    "Average wellbeing score."
  )
)

# Render the data dictionary as a table
kable(data_dictionary, format = "html", caption = "Data Dictionary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Data Dictionary
Column_Name Data_Type Description
area_name character Name of the local authority area.
area_code character Unique code for the local authority.
Population numeric Total population of the area.
Poverty numeric Percentage of population in poverty.
CVD numeric Percentage with cardiovascular disease.
overweight numeric Proportion of overweight individuals.
smokers numeric Proportion of smokers in the area.
wellbeing numeric Average wellbeing score.

Data Loading and Integrity Check

# Load the dataset
cvd_data <- read_csv("Cardio_Vascular_Disease.csv")

# Display the first few rows
head(cvd_data)
## # A tibble: 6 × 8
##   area_name            area_code Population Poverty   CVD overweight smokers wellbeing
##   <chr>                <chr>          <dbl>   <dbl> <dbl>      <dbl>   <dbl>     <dbl>
## 1 Hartlepool           E06000001      88905    23    13.7       34.6    17.3      7.33
## 2 Middlesbrough        E06000002     133375    25.1  13.1       32.0    17.9      7.21
## 3 Redcar and Cleveland E06000003     131035    21.4  15         33.4    13.3      7.44
## 4 Stockton-on-Tees     E06000004     186365    19.5  12.4       40.2    12.5      7.4 
## 5 Darlington           E06000005     104300    19.7  11.9       35.7    10.6      7.25
## 6 Halton               E06000006     125725    21.9  15.7       32.3    13.2      7.29
# Check the structure of the dataset
str(cvd_data)
## spc_tbl_ [385 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ area_name : chr [1:385] "Hartlepool" "Middlesbrough" "Redcar and Cleveland" "Stockton-on-Tees" ...
##  $ area_code : chr [1:385] "E06000001" "E06000002" "E06000003" "E06000004" ...
##  $ Population: num [1:385] 88905 133375 131035 186365 104300 ...
##  $ Poverty   : num [1:385] 23 25.1 21.4 19.5 19.7 21.9 17.9 24.9 26.1 22.4 ...
##  $ CVD       : num [1:385] 13.7 13.1 15 12.4 11.9 15.7 12 13.4 17.6 11.2 ...
##  $ overweight: num [1:385] 34.6 32 33.4 40.2 35.7 ...
##  $ smokers   : num [1:385] 17.3 17.9 13.3 12.5 10.6 13.2 10 15.5 20.6 22 ...
##  $ wellbeing : num [1:385] 7.33 7.21 7.44 7.4 7.25 7.29 7.21 7.43 7.16 7.14 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   area_name = col_character(),
##   ..   area_code = col_character(),
##   ..   Population = col_double(),
##   ..   Poverty = col_double(),
##   ..   CVD = col_double(),
##   ..   overweight = col_double(),
##   ..   smokers = col_double(),
##   ..   wellbeing = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Strengths in the Structure
# Variable Types: The data types (chr for text and num for numeric variables) seem appropriate for each column.                                                 
# Dimensions: The dataset has 385 rows and 8 columns, which matches the expected structure based on the description.

# Inspect the data
summary(cvd_data)
##   area_name          area_code           Population         Poverty           CVD       
##  Length:385         Length:385         Min.   :   1960   Min.   :12.90   Min.   : 7.90  
##  Class :character   Class :character   1st Qu.: 100640   1st Qu.:16.90   1st Qu.:10.90  
##  Mode  :character   Mode  :character   Median : 135275   Median :18.70   Median :12.30  
##                                        Mean   : 174804   Mean   :19.34   Mean   :12.42  
##                                        3rd Qu.: 216600   3rd Qu.:21.40   3rd Qu.:14.10  
##                                        Max.   :1056970   Max.   :30.70   Max.   :17.80  
##                                        NA's   :76        NA's   :76      NA's   :76     
##    overweight       smokers        wellbeing    
##  Min.   :10.24   Min.   : 3.20   Min.   :6.610  
##  1st Qu.:21.63   1st Qu.:10.72   1st Qu.:7.280  
##  Median :25.52   Median :12.80   Median :7.410  
##  Mean   :25.52   Mean   :13.07   Mean   :7.423  
##  3rd Qu.:29.46   3rd Qu.:15.30   3rd Qu.:7.560  
##  Max.   :40.22   Max.   :27.80   Max.   :8.170  
##  NA's   :72      NA's   :7       NA's   :15
# When comparing means and medians, the Population variable is notably right-skewed, with the mean (174,804) exceeding the median (135,275), indicating a few regions with large populations. In contrast, variables such as Poverty, CVD Prevalence, and Overweight exhibit symmetric distributions, suggesting balanced socio-economic and health conditions across regions. Smokers show a slight right skew, while Wellbeing scores remain uniformly distributed, reflecting consistent wellbeing levels throughout the dataset. Visualisations performed after data cleaning can further verify these observations.

# The dataset contains missing values (NA) in several critical variables, let us look for any underlying patterns to these.

# Create 'Region' variable based on 'area_code' prefix
cvd_data <- cvd_data %>%
  mutate(
    Region = case_when(
      str_starts(area_code, "E") ~ "England",
      str_starts(area_code, "S") ~ "Scotland",
      str_starts(area_code, "W") ~ "Wales",
      str_starts(area_code, "N") ~ "Northern Ireland",
      TRUE ~ "Other"
    )
  )

# Calculate missing counts and total counts, then compute percentage
missing_by_region <- cvd_data %>%
  group_by(Region) %>%
  summarise(
    Missing_Count = sum(is.na(Population) | is.na(Poverty) | is.na(CVD) | is.na(overweight)),
    Total_Count = n(),
    Missing_Percentage = (Missing_Count / Total_Count) * 100
  ) %>%
# Replace NA percentages with 0 for regions without missing data
  mutate(Missing_Percentage = replace_na(Missing_Percentage, 0))

# Display the summary table with percentages
kable(
  missing_by_region,
  col.names = c("Region", "Missing Rows", "Total Rows", "Percentage Missing"),
  caption = "Percentage of Missing Rows by Region"
)
Percentage of Missing Rows by Region
Region Missing Rows Total Rows Percentage Missing
England 11 320 3.4375
Northern Ireland 11 11 100.0000
Scotland 32 32 100.0000
Wales 22 22 100.0000
# Since Northern Ireland, Scotland, and Wales have all their data missing in essential variables, we can exclude these regions during data cleaning and focus our analysis solely on England, which we can further clean up by dropping the rows with missing values since only 3.4% of the data is missing.

# Check for duplicate rows
duplicate_rows <- cvd_data[duplicated(cvd_data), ]
print(duplicate_rows)
## # A tibble: 0 × 9
## # ℹ 9 variables: area_name <chr>, area_code <chr>, Population <dbl>, Poverty <dbl>, CVD <dbl>,
## #   overweight <dbl>, smokers <dbl>, wellbeing <dbl>, Region <chr>
# No duplicates were found in the dataset. The data is clean and unique across all records.

# Check data types of each variable
data_types <- cvd_data %>%
  summarise_all(class) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Data_Type")

kable(
  data_types,
  caption = "Data Types of Each Variable"
)
Data Types of Each Variable
Variable Data_Type
area_name character
area_code character
Population numeric
Poverty numeric
CVD numeric
overweight numeric
smokers numeric
wellbeing numeric
Region character
# The data types for each variable are correctly assigned based on the expected content: categorical variables as 'character' and numerical data as 'numeric'.

# Checking to see that all areas have only 1 record

# Check for duplicates in both 'area_code' and 'area_name' columns
duplicates <- cvd_data %>%
  filter(duplicated(area_code) | duplicated(area_name))

# Display the duplicate rows, if any
if (nrow(duplicates) > 0) {
  kable(duplicates, caption = "Duplicate Area Codes or Area Names")
} else {
  print("No duplicates found in 'area_code' or 'area_name' columns.")
}
## [1] "No duplicates found in 'area_code' or 'area_name' columns."
# Since there are no duplicates, moving further.

# Boxplots for numerical variables to identify outliers
p.boxplots <-
cvd_data %>%
  select(Population, Poverty, CVD, overweight, smokers, wellbeing) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Variable",
    values_to = "Value"
  ) %>%
  ggplot(aes(x = "", y = Value, fill = Variable)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, width = 0.5) +
  theme_minimal(base_size = 12) +
  labs(
    title = "Distribution of Key Variables: Boxplots",
    subtitle = "Highlighting the spread and outliers for each variable",
    x = NULL,
    y = "Value"
  ) +
  facet_wrap(
    ~ Variable,
    scales = "free",
    strip.position = "bottom"
  ) +
  theme(
    # Remove unnecessary x-axis elements
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    # Style facet labels
    strip.background = element_rect(fill = "grey90", color = "grey80"),
    strip.text = element_text(face = "bold", color = "black"),
    # Add padding and alignments for clarity
    plot.title = element_text(face = "bold", hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5, margin = margin(b = 10)),
    panel.grid.major.y = element_line(color = "grey85", linetype = "dashed")
  ) +
  scale_fill_brewer(palette = "Set3", guide = "none")

p.boxplots

# As observed earlier, outliers are present and are highlighted in red for the variables: Population, Poverty, Smokers, and Wellbeing. Further investigation is needed to verify the validity of these outliers so they can be dealt with appropriately.

# Checking population outliers

# Calculate IQR with NA removal
Q1 <- quantile(cvd_data$Population, 0.25, na.rm = TRUE)
Q3 <- quantile(cvd_data$Population, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1

# Define outlier thresholds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Extract outlier records
population_outliers <- cvd_data %>%
  filter(Population < lower_bound | Population > upper_bound)

# View outlier records
print(population_outliers)
## # A tibble: 13 × 9
##    area_name             area_code Population Poverty   CVD overweight smokers wellbeing Region 
##    <chr>                 <chr>          <dbl>   <dbl> <dbl>      <dbl>   <dbl>     <dbl> <chr>  
##  1 Bristol               E06000023     443090    19.6  13.1       20.6    16.4      7.15 England
##  2 County Durham         E06000047     499685    20.8  14.3       34.0    16.2      7.43 England
##  3 Cornwall              E06000052     549610    20    13.8       25.0    11.5      7.57 England
##  4 Wiltshire             E06000054     489800    17    12.5       22.5    11.7      7.51 England
##  5 Buckinghamshire       E06000060     536030    17.7  11.9       21.4    10.5      7.39 England
##  6 West Northamptonshire E06000062     408895    18.1  10.1       29.0    NA        7.55 England
##  7 Manchester            E08000003     504475    29.1  12.2       26.2    16.8      7.12 England
##  8 Liverpool             E08000012     437490    25.4  15.1       29.0    17.8      7.12 England
##  9 Sheffield             E08000019     518885    19.1  12.9       25.4    13.3      7.13 England
## 10 Birmingham            E08000025    1056970    24.5  11.4       26.9    16.1      7.2  England
## 11 Bradford              E08000032     513785    20.8  10.4       27.6    15.4      7.29 England
## 12 Kirklees              E08000034     416280    18.3  11.9       27.4    13.6      7.2  England
## 13 Leeds                 E08000035     760265    17.9  11.2       23.8    12.1      7.28 England
# Since these records include major cities like Birmingham, Manchester and Leeds, the records are most likely valid and just represent the population of the metropolitan cities. Removing these outliers would result in the loss of a significant portion of the data. They do not need to be removed or changed. 

# Analysing other variables with outliers
summary(cvd_data$Poverty)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.90   16.90   18.70   19.34   21.40   30.70      76
summary(cvd_data$smokers)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    3.20   10.72   12.80   13.07   15.30   27.80       7
summary(cvd_data$wellbeing)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   6.610   7.280   7.410   7.423   7.560   8.170      15
# The summary statistics for poverty, smokers, and wellbeing confirm the presence of outliers. These outliers are retained for several reasons:

# 1. Poverty: Outliers may represent regions with genuine economic challenges, providing insights into areas needing intervention.
# 2. Smokers: Low values could indicate regions with specific cultural or policy influences on smoking rates. For example, targeted anti smoking campaigns, offering insights into health policy effectiveness.

# Removing these outliers would risk losing valuable information about regional disparities and trends.

# 3. Wellbeing: The distribution shows little variation, making the outliers less impactful on overall analysis. Thus, these can also be kept.

Data Cleaning

This section addresses missing values by excluding regions with complete missingness in critical variables and removing any remaining incomplete entries within England and focusing on complete case analysis. By doing so, we ensure that the dataset is reliable and ready for subsequent analyses.

# Exclude non-English regions
cvd_data_england <- cvd_data %>%
  filter(Region == "England")

# Verify exclusion
kable(
  table(cvd_data_england$Region),
  caption = "Number of Areas by Region After Exclusion"
)
Number of Areas by Region After Exclusion
Var1 Freq
England 320
# Count missing values in England data
missing_england_before <- cvd_data_england %>%
  dplyr::summarize(
    Missing_Population = sum(is.na(Population)),
    Missing_Poverty = sum(is.na(Poverty)),
    Missing_CVD = sum(is.na(CVD)),
    Missing_Overweight = sum(is.na(overweight)),
    Missing_smokers = sum(is.na(smokers)),
    Missing_wellbeing = sum(is.na(wellbeing))
  )

# View the summary
print(missing_england_before)
## # A tibble: 1 × 6
##   Missing_Population Missing_Poverty Missing_CVD Missing_Overweight Missing_smokers
##                <int>           <int>       <int>              <int>           <int>
## 1                 11              11          11                  7               7
## # ℹ 1 more variable: Missing_wellbeing <int>
# Remove rows with any missing critical variables
cvd_data_england_clean <- cvd_data_england %>%
  drop_na(Population, Poverty, CVD, overweight, smokers, wellbeing)

# Confirm removal
missing_england_after <- cvd_data_england_clean %>%
  dplyr::summarize(
    Missing_Population = sum(is.na(Population)),
    Missing_Poverty = sum(is.na(Poverty)),
    Missing_CVD = sum(is.na(CVD)),
    Missing_Overweight = sum(is.na(overweight)),
    Missing_smokers = sum(is.na(smokers)),
    Missing_wellbeing = sum(is.na(wellbeing))
  )

# View the summary
print(missing_england_after)
## # A tibble: 1 × 6
##   Missing_Population Missing_Poverty Missing_CVD Missing_Overweight Missing_smokers
##                <int>           <int>       <int>              <int>           <int>
## 1                  0               0           0                  0               0
## # ℹ 1 more variable: Missing_wellbeing <int>
# We now have a dataset with no missing data in all critical variables.

# Display impact on sample size
removed_obs <- nrow(cvd_data_england) - nrow(cvd_data_england_clean)
kable(
  data.frame(
    Description = c("Original Observations", "Final Observations", "Removed Cases"),
    Count = c(nrow(cvd_data_england), nrow(cvd_data_england_clean), removed_obs)
  ),
  caption = "Impact of Complete Case Analysis on Sample Size"
)
Impact of Complete Case Analysis on Sample Size
Description Count
Original Observations 320
Final Observations 303
Removed Cases 17
# Using the "cvd_data_england_clean" for all further analysis, focusing on performing complete case analysis on data from England.

Visualisations

The following visualizations provide insights into the distribution and relationships of key health and socioeconomic variables across local authority regions in England. Through histograms, scatter plots with regression lines, and a correlation matrix, we explore patterns in population, poverty, cardiovascular disease (CVD) prevalence, overweight, smoking rates, and wellbeing.

# Select numerical variables
numerical_vars <- c("Population", "Poverty", "CVD", "overweight", "smokers", "wellbeing")

# Reshape data to long format for easy faceting
cvd_long <- cvd_data_england_clean %>%
  pivot_longer(cols = all_of(numerical_vars), names_to = "Variable", values_to = "Value")

# Create histograms with faceting
ggplot(cvd_long, aes(x = Value)) +
  geom_histogram(
    fill = "skyblue",
    color = "black",
    alpha = 0.7,
    bins = 30
  ) +
  facet_wrap(
    ~ Variable,
    scales = "free_x",
    strip.position = "bottom"
  ) +
  theme_minimal(base_size = 12) +
  labs(
    title = "Distribution of Key Variables: Histograms",
    subtitle = "Visualizing the frequency distribution for each variable",
    x = "Value",
    y = "Frequency"
  ) +
  theme(
    # Center-align titles and subtitles
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(hjust = 0.5, margin = margin(b = 10)),
    # Customize facet labels
    strip.text = element_text(face = "bold", size = 10, color = "black"),
    strip.background = element_rect(fill = "grey90", color = "grey80"),
    # Refine gridlines
    panel.grid.major = element_line(color = "grey85", linetype = "dashed"),
    panel.grid.minor = element_blank(),
    # Add padding to axes
    axis.title.x = element_text(margin = margin(t = 10)),
    axis.title.y = element_text(margin = margin(r = 10))
  )

# The histograms confirm the previously noted distributions of key variables. Population remains right-skewed, with a few regions having significantly larger populations. Poverty, CVD Prevalence, and Overweight show symmetric distributions, aligning with balanced socio-economic and health conditions. The slight right skew in Smokers and the uniform distribution of Wellbeing scores are also evident.

# Create Scatter Plot with Regression Line
p.scatter_poverty <- ggplot(cvd_data_england_clean, aes(x = Poverty, y = CVD)) +
  geom_point(color = "darkblue", alpha = 0.6, size = 2) +  # Scatter points
  geom_smooth(method = "lm", se = TRUE, color = "orange", linetype = "dashed") +  # Regression line with confidence interval
  theme_minimal() +  # Clean theme
  labs(
    title = "Effect of Poverty on Cardiovascular Disease Prevalence",
    subtitle = "Analysis of Local Authority Regions in England",
    x = "Proportion of Population Living in Poverty (%)",
    y = "Cardiovascular Disease Prevalence (%)",
    caption = "Data Source: UK Office for National Statistics"
  ) +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10)
  )

print(p.scatter_poverty)

# This scatter plot illustrates the relationship between poverty levels and cardiovascular disease (CVD) prevalence across local authority regions in England. Each point represents a region, with the x-axis showing the proportion of the population living in poverty and the y-axis indicating CVD prevalence. The orange dashed line represents a linear regression fit, suggesting a slight negative correlation between poverty and CVD prevalence. The shaded area around the line indicates the confidence interval, reflecting the uncertainty of the regression estimate.

# Select relevant numerical variables
variables <- c("overweight", "smokers", "wellbeing")

# Create scatter plots for each variable
scatter_plots <- lapply(variables, function(var) {
  ggplot(cvd_data_england_clean, aes_string(x = var, y = "CVD")) +
    geom_point(color = "darkblue", alpha = 0.6, size = 2) +
    geom_smooth(
      method = "lm",
      se = TRUE,
      color = "orange",
      linetype = "dashed",
      size = 1
    ) +
    theme_minimal(base_size = 12) +
    labs(
      x = var,
      y = "CVD Prevalence (%)"
    ) +
    theme(
      # Customize axis labels
      axis.title.x = element_text(face = "bold", margin = margin(t = 10)),
      axis.title.y = element_text(face = "bold", margin = margin(r = 10)),
      # Add gridlines for clarity
      panel.grid.major = element_line(color = "grey85", linetype = "dashed"),
      panel.grid.minor = element_blank()
    )
})

# Combine plots with an overarching title
p.scatter_plots <- grid.arrange(
  grobs = scatter_plots,
  ncol = 2,
  top = textGrob(
    "Scatter Plots of CVD Prevalence vs Key Predictors",
    gp = gpar(fontsize = 16, fontface = "bold")
  )
)

# These scatter plots illustrate the relationship between cardiovascular disease (CVD) prevalence and key predictors: overweight, smokers, and wellbeing. Each plot shows individual data points for regions, with the x-axis representing the predictor variable and the y-axis showing CVD prevalence. The orange dashed lines represent linear regression fits, indicating potential trends. In the overweight and smokers plots, there is a slight positive correlation with CVD prevalence, suggesting higher rates may be associated with increased CVD. The wellbeing plot shows a weaker correlation, indicating a more complex relationship. The shaded areas around the lines represent confidence intervals, highlighting the uncertainty in the estimates.

# Compute the correlation matrix
corr_matrix <- cor(
  cvd_data_england_clean[, c("Poverty", "CVD", "overweight", "smokers", "wellbeing")],
  use = "complete.obs"
)

# Create the correlation plot
ggcorrplot(
  corr_matrix,
  lab = TRUE,              # Add correlation coefficients
  lab_size = 4,            # Adjust label size
  colors = c("#6D9EC1", "white", "#E46726"), # Custom color palette
  outline.color = "gray",  # Outline circles with gray
  ggtheme = theme_minimal() # Use a minimal theme
) +
  labs(
    title = "Correlation Matrix of Key Variables",
    subtitle = "Exploring relationships between health and socioeconomic factors",
    caption = "Note: Correlation coefficients range from -1 (negative) to +1 (positive)."
  ) +
  theme(
    # Center-align title and subtitle
    plot.title = element_text(face = "bold", hjust = 0.5, size = 14),
    plot.subtitle = element_text(hjust = 0.5, size = 10, margin = margin(b = 10)),
    plot.caption = element_text(size = 8, hjust = 0.5, margin = margin(t = 10))
  )

# The correlation matrix provides a visual summary of the relationships between key health and socioeconomic variables. Notable correlations include a moderate positive relationship between smokers and overweight, suggesting a potential link between smoking and weight issues. There is a negative correlation between wellbeing and poverty, indicating that higher poverty levels are associated with lower wellbeing. CVD shows a slight negative correlation with poverty, suggesting complex interactions that may require further investigation. However, there is no sign of high pairwise correlation, our model thus is not at risk of multicollinearity. 

Statistical Analysis

This section explores the combined effects of overweight, smoking, wellbeing, and poverty on cardiovascular disease (CVD) prevalence using multiple linear regression. By examining both the statistical significance and confidence intervals of each predictor, we aim to understand their individual contributions to CVD prevalence. The analysis employs both Null Hypothesis Significance Testing (NHST) and estimation approaches to provide a comprehensive view of the data.

#Multiple Linear Regression
#We begin by examining the combined effect of overweight, smoking, wellbeing, and poverty on CVD prevalence.

# Fit a multiple linear regression model
m.cvd <- lm(CVD ~ overweight + smokers + wellbeing + Poverty, data = cvd_data_england_clean)

# Summarize the model
summary(m.cvd)
## 
## Call:
## lm(formula = CVD ~ overweight + smokers + wellbeing + Poverty, 
##     data = cvd_data_england_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3966 -1.4313 -0.1097  1.3905  4.7779 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.69780    3.93361  -0.432 0.666335    
## overweight   0.10985    0.02123   5.174 4.22e-07 ***
## smokers      0.12030    0.03366   3.574 0.000410 ***
## wellbeing    1.80025    0.49096   3.667 0.000291 ***
## Poverty     -0.18400    0.03515  -5.234 3.13e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.894 on 298 degrees of freedom
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2393 
## F-statistic: 24.75 on 4 and 298 DF,  p-value: < 2.2e-16
# Multiple regression indicates a significant positive effect of overweight on CVD prevalence (t(298) = 5.17, p < 0.001), with each 1% increase in overweight predicting a 0.11% rise in CVD prevalence (CI = [0.07, 0.15]).
# Smoking also shows a positive effect (t(298) = 3.57, p < 0.001), with each 1% increase predicting a 0.12% rise (CI = [0.06, 0.18]).
# Wellbeing has a positive effect (t(298) = 3.67, p < 0.001), with each unit increase predicting a 1.80% rise (CI = [0.84, 2.76]).
# Poverty shows a negative effect (t(298) = -5.23, p < 0.001), with each 1% increase predicting a 0.18% decrease in CVD prevalence (CI = [-0.25, -0.11]).

# NHST Approach:
# All predictors are statistically significant (p < 0.001), suggesting strong evidence against the null hypothesis of no effect.

# Get confidence intervals for estimation
conf_intervals <- confint(m.cvd)

# Tidy the model output and add confidence intervals
tidy_mcvd_model <- tidy(m.cvd) %>%
  mutate(
    `Lower 95% CI` = conf_intervals[, 1],
    `Upper 95% CI` = conf_intervals[, 2]
  ) %>%
  select(term, estimate, std.error, statistic, p.value, `Lower 95% CI`, `Upper 95% CI`)

# Create a professional-style table
kable(tidy_mcvd_model, format = "html", digits = 3, caption = "Confidence Intervals for Multiple Regression Coefficients") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  footnote(general = "Note: CI = Confidence Interval. Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05")
Confidence Intervals for Multiple Regression Coefficients
term estimate std.error statistic p.value Lower 95% CI Upper 95% CI
(Intercept) -1.698 3.934 -0.432 0.666 -9.439 6.043
overweight 0.110 0.021 5.174 0.000 0.068 0.152
smokers 0.120 0.034 3.574 0.000 0.054 0.187
wellbeing 1.800 0.491 3.667 0.000 0.834 2.766
Poverty -0.184 0.035 -5.234 0.000 -0.253 -0.115
Note:
Note: CI = Confidence Interval. Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05
# Estimation Approach:
# CI for overweight ([0.07, 0.15]) suggests that the true effect of overweight on CVD prevalence is likely between 0.07% and 0.15% per 1% increase.
# CI for smokers ([0.054, 0.187]) indicates the true effect on CVD prevalence is likely between 0.054% and 0.187% per 1% increase.
# CI for wellbeing ([0.834, 2.766]) suggests the true effect on CVD prevalence is likely between 0.834% and 2.766% per unit increase.
# CI for poverty ([-0.253, -0.115]) indicates the true effect on CVD prevalence is likely between -0.253% and -0.115% per 1% increase.

Section 2

Report on Cardiovascular Disease (CVD) Prevalence in England

This report presents findings from a comprehensive statistical analysis of CVD prevalence across various local authority regions in England. The analysis aims to identify significant predictors of CVD rates, providing insights to guide public health strategies.

Data Collection and Preparation

Data was sourced from the UK Office for National Statistics, focusing on key health and socioeconomic variables. Initial processing addressed missing values and ensured data integrity, establishing a reliable foundation for analysis.

Visual Analysis

Boxplots were used to explore the variability in CVD prevalence and its predictors across different regions. These plots revealed significant differences between regions, helping identify areas requiring targeted intervention. Regions with extreme values (outliers) were flagged for further analysis.

Scatter plots demonstrated relationships between key predictors—such as smoking prevalence, obesity rates, and poverty—and CVD prevalence. Regression lines overlaid on these plots highlighted clear trends, including a positive correlation between smoking/overweight rates and CVD prevalence, and unexpected trends in socioeconomic factors.

Multiple Linear Regression Analysis

The analysis employs multiple linear regression to assess the impact of various predictors on CVD prevalence:

Overweight Population (%): A significant positive association was observed, with each 1% increase in overweight individuals predicting a 0.11% rise in CVD prevalence (95% CI: [0.07, 0.15]). This highlights obesity as a critical modifiable risk factor.

Smoking Prevalence (%): Smoking was another significant predictor, with a 1% increase predicting a 0.12% rise in CVD prevalence (95% CI: [0.054, 0.187]). This underscores the importance of anti-smoking campaigns.

Wellbeing Score: Surprisingly, higher wellbeing scores were associated with increased CVD prevalence, with each unit increase predicting a 1.8% rise (95% CI: [0.834, 2.766]). This unexpected finding suggests the need for further research into confounding factors such as stress or lifestyle components.

Poverty Rate (%): A 1% increase in poverty rate was associated with a 0.18% decrease in CVD prevalence (95% CI: [-0.253, -0.115]). This counterintuitive relationship may reflect disparities in healthcare access, reporting biases, or other complex socioeconomic factors.

Key Recommendations

1. Address Lifestyle Factors Implement targeted interventions to reduce smoking and obesity rates. Public health campaigns and community-based fitness initiatives, especially in schools and workplaces, can help lower CVD prevalence.

2. Investigate Wellbeing Paradox Further research is needed to understand the positive association between wellbeing and CVD prevalence. Investigations should examine confounding variables such as mental health and stress.

3. Consider Socioeconomic Disparities Develop strategies tailored to low-income populations, addressing barriers to healthcare and preventive measures. Policies that improve healthcare access and promote healthy behaviors in economically disadvantaged regions can mitigate risks.

Conclusion

The study highlights the multifaceted drivers of CVD prevalence in England, including modifiable lifestyle factors (e.g., obesity and smoking) and socioeconomic variables. By focusing on these determinants, public health strategies can be fine-tuned to reduce CVD prevalence and enhance health outcomes. Collaborative efforts involving policymakers, healthcare providers, and communities are essential to effectively implement these evidence-based interventions.


QUESTION 2 : Customer Satisfaction

Section 1

Introduction to Analysis

This report examines factors influencing customer satisfaction across various store locations of a furniture retail company. Utilizing data collected by the company, we investigate how staff job satisfaction, delivery times for large and custom items, availability of new product ranges, and the store’s socio-economic status (SES) impact overall customer satisfaction scores. The objective is to identify significant predictors and assess whether the effect of delivery times varies across different SES categories.

The analysis includes:

Data Integrity Checks: Ensuring the dataset’s reliability by identifying and addressing missing or inconsistent data. Visualizations: Creating graphs to illustrate the distribution of customer satisfaction scores and explore relationships with influencing factors. Multiple Linear Regression with Interaction Effects: Assessing the impact of staff satisfaction, delivery times, new product ranges, and SES on customer satisfaction, including interactions between delivery times and SES. Interpretation of Results: Translating statistical findings into actionable recommendations for enhancing customer satisfaction.

The report is designed to be reproducible, featuring well-documented and organized code that allows future analysts to adapt the analysis to new or updated datasets effortlessly.


Data Dictionary

The data dictionary provides an overview of the variables included in the dataset used for the analysis. It outlines each variable’s name, data type, and a brief description of its purpose or content.

# Create the data dictionary as a data frame
data_dictionary <- data.frame(
  Column_Name = c("SES_category", "customer.satisfaction", "staff.satisfaction", "delivery.time", "new_range"),
  Data_Type = c("Character", "Numeric", "Numeric", "Numeric", "Logical"),
  Description = c(
    "Socio-economic status category of the store area (Low, Medium, High).",
    "Average customer satisfaction score measured on a scale from 1 to 10.",
    "Average staff job satisfaction score measured on a scale from 1 to 10.",
    "Average delivery time for large and custom items in hours.",
    "Indicator of whether the store carries a new range of products (TRUE/FALSE)."
  )
)

# Render the data dictionary as a table
kable(data_dictionary, format = "html", caption = "Data Dictionary for Customer Satisfaction Analysis") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Data Dictionary for Customer Satisfaction Analysis
Column_Name Data_Type Description
SES_category Character Socio-economic status category of the store area (Low, Medium, High).
customer.satisfaction Numeric Average customer satisfaction score measured on a scale from 1 to 10.
staff.satisfaction Numeric Average staff job satisfaction score measured on a scale from 1 to 10.
delivery.time Numeric Average delivery time for large and custom items in hours.
new_range Logical Indicator of whether the store carries a new range of products (TRUE/FALSE).

Data Loading, Integrity Check and Cleaning

# Load the dataset
customer_data <- read_csv("cust_satisfaction.csv")

# Display the first few rows
head(customer_data)
## # A tibble: 6 × 5
##   SES_category customer.satisfaction staff.satisfaction delivery.time new_range
##   <chr>                        <dbl>              <dbl>         <dbl> <lgl>    
## 1 Medium                        7.27               6.88          66.7 FALSE    
## 2 Medium                        7.93               7.44          68.2 FALSE    
## 3 Medium                        7.12               7.15          70.7 FALSE    
## 4 High                          6.35               6.47          61.4 TRUE     
## 5 High                          6.78               7.06          57.7 TRUE     
## 6 Medium                        8.07               7.03          38.0 TRUE
# Check the structure of the dataset
str(customer_data)
## spc_tbl_ [300 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SES_category         : chr [1:300] "Medium" "Medium" "Medium" "High" ...
##  $ customer.satisfaction: num [1:300] 7.27 7.93 7.12 6.35 6.78 ...
##  $ staff.satisfaction   : num [1:300] 6.88 7.44 7.15 6.47 7.06 ...
##  $ delivery.time        : num [1:300] 66.7 68.2 70.7 61.4 57.7 ...
##  $ new_range            : logi [1:300] FALSE FALSE FALSE TRUE TRUE TRUE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SES_category = col_character(),
##   ..   customer.satisfaction = col_double(),
##   ..   staff.satisfaction = col_double(),
##   ..   delivery.time = col_double(),
##   ..   new_range = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
# The `SES_category` variable is currently of type character and should be converted to a factor with levels "Low", "Medium", and "High" to facilitate categorical analysis and interaction modeling. All other variables (`customer.satisfaction`, `staff.satisfaction`, `delivery.time`, `new_range`) are correctly formatted as numeric or logical types, ensuring they are ready for regression analysis.

# Convert SES_category to an ordered factor
customer_data$SES_category <- factor(customer_data$SES_category, 
                                     levels = c("Low", "Medium", "High"), 
                                     ordered = TRUE)

# Verify the change
str(customer_data)
## spc_tbl_ [300 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SES_category         : Ord.factor w/ 3 levels "Low"<"Medium"<..: 2 2 2 3 3 2 2 2 2 3 ...
##  $ customer.satisfaction: num [1:300] 7.27 7.93 7.12 6.35 6.78 ...
##  $ staff.satisfaction   : num [1:300] 6.88 7.44 7.15 6.47 7.06 ...
##  $ delivery.time        : num [1:300] 66.7 68.2 70.7 61.4 57.7 ...
##  $ new_range            : logi [1:300] FALSE FALSE FALSE TRUE TRUE TRUE ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SES_category = col_character(),
##   ..   customer.satisfaction = col_double(),
##   ..   staff.satisfaction = col_double(),
##   ..   delivery.time = col_double(),
##   ..   new_range = col_logical()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Inspect the data
summary(customer_data)
##  SES_category customer.satisfaction staff.satisfaction delivery.time   new_range      
##  Low   :100   Min.   :3.760         Min.   :4.850      Min.   :32.96   Mode :logical  
##  Medium:100   1st Qu.:6.099         1st Qu.:6.216      1st Qu.:52.46   FALSE:142      
##  High  :100   Median :7.006         Median :6.634      Median :60.45   TRUE :158      
##               Mean   :6.930         Mean   :6.751      Mean   :59.60                  
##               3rd Qu.:7.865         3rd Qu.:7.274      3rd Qu.:66.79                  
##               Max.   :9.670         Max.   :8.860      Max.   :92.48
# Summary Insights:
# SES_category is a categorical variable with no detailed levels provided. Both customer.satisfaction (mean: 6.93) and staff.satisfaction (mean: 6.75) show high average ratings, with customer satisfaction having a slightly wider range and a left-skewed distribution. Delivery.time ranges from 32.96 to 92.48 with a median of 60.45, indicating right-skewness and potential delays in some regions. 

# Check for missing values in each column
missing_summary <- customer_data %>%
  summarize_all(~ sum(is.na(.))) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count")

# Display missing values as a table
kable(
  missing_summary, 
  caption = "Missing Values per Variable"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Missing Values per Variable
Variable Missing_Count
SES_category 0
customer.satisfaction 0
staff.satisfaction 0
delivery.time 0
new_range 0
# The dataset has been examined for missing values across all variables. No missing values were found, ensuring complete data for accurate analysis. 

# Check for duplicate rows in the dataset
duplicate_rows <- customer_data %>%
  filter(duplicated(.))

# Count the number of duplicate rows
duplicate_count <- nrow(duplicate_rows)

# Display the number of duplicate rows
kable(
  data.frame(
    Description = "Number of Duplicate Rows",
    Count = duplicate_count
  ),
  caption = "Duplicate Rows Count"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Duplicate Rows Count
Description Count
Number of Duplicate Rows 0
# Since there are no duplicates, moving further.

# Select numerical variables to check for outliers
numerical_vars <- c("customer.satisfaction", "staff.satisfaction", "delivery.time")

# Reshape data to long format for faceted boxplots
customer_long <- customer_data %>%
  pivot_longer(cols = all_of(numerical_vars), names_to = "Variable", values_to = "Value")

# Create boxplots for each numerical variable
ggplot(customer_long, aes(x = "", y = Value, fill = Variable)) +
  geom_boxplot(outlier.color = "red", outlier.shape = 16, alpha = 0.7) +
  facet_wrap(~ Variable, scales = "free_y") +
  theme_minimal(base_size = 14) +
  labs(
    title = "Boxplots of Key Numerical Variables",
    x = NULL,
    y = "Value"
  ) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 16),
    axis.title.y = element_text(face = "bold"),
    axis.text.x = element_blank(),  # Remove x-axis text
    axis.ticks.x = element_blank(),  # Remove x-axis ticks
    strip.text = element_text(face = "bold", size = 12),
    legend.position = "none"
  ) +
  scale_fill_brewer(palette = "Set2")

# Outlier Analysis:
# The boxplots reveal outliers in the delivery time variable, indicated by red points above the main distribution. These outliers suggest occasional delays that exceed typical delivery times and may require further investigation. No significant outliers are observed in customer satisfaction or staff satisfaction, indicating consistent performance in these areas.

# Calculate the IQR for delivery time
Q1 <- quantile(customer_data$delivery.time, 0.25)
Q3 <- quantile(customer_data$delivery.time, 0.75)
IQR <- Q3 - Q1

# Define outlier thresholds
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR

# Identify outliers
outliers <- customer_data %>%
  filter(delivery.time < lower_bound | delivery.time > upper_bound)

# View outlier records
print(outliers)
## # A tibble: 5 × 5
##   SES_category customer.satisfaction staff.satisfaction delivery.time new_range
##   <ord>                        <dbl>              <dbl>         <dbl> <lgl>    
## 1 Low                           8.26               6.59          89.4 TRUE     
## 2 High                          5.64               6.24          92.5 TRUE     
## 3 Low                           5.89               5.10          90.3 FALSE    
## 4 Medium                        7.01               7.13          92.5 TRUE     
## 5 Low                           4.94               6.77          92.5 FALSE
# Delivery time has significant outliers, the effect of whose need to be evaluated further to decide how to treat them the right way.

# Separate data into two sets
data_with_outliers <- customer_data  

data_without_outliers <- customer_data %>%
  filter(delivery.time >= lower_bound & delivery.time <= upper_bound)

# View the datasets
print("Data with Outliers (All Records):")
## [1] "Data with Outliers (All Records):"
print(data_with_outliers)
## # A tibble: 300 × 5
##    SES_category customer.satisfaction staff.satisfaction delivery.time new_range
##    <ord>                        <dbl>              <dbl>         <dbl> <lgl>    
##  1 Medium                        7.27               6.88          66.7 FALSE    
##  2 Medium                        7.93               7.44          68.2 FALSE    
##  3 Medium                        7.12               7.15          70.7 FALSE    
##  4 High                          6.35               6.47          61.4 TRUE     
##  5 High                          6.78               7.06          57.7 TRUE     
##  6 Medium                        8.07               7.03          38.0 TRUE     
##  7 Medium                        9.00               7.35          52.8 TRUE     
##  8 Medium                        7.53               7.77          68.8 FALSE    
##  9 Medium                        7.42               7.52          79.6 FALSE    
## 10 High                          7.45               6.56          57.7 TRUE     
## # ℹ 290 more rows
print("Data without Outliers:")
## [1] "Data without Outliers:"
print(data_without_outliers)
## # A tibble: 295 × 5
##    SES_category customer.satisfaction staff.satisfaction delivery.time new_range
##    <ord>                        <dbl>              <dbl>         <dbl> <lgl>    
##  1 Medium                        7.27               6.88          66.7 FALSE    
##  2 Medium                        7.93               7.44          68.2 FALSE    
##  3 Medium                        7.12               7.15          70.7 FALSE    
##  4 High                          6.35               6.47          61.4 TRUE     
##  5 High                          6.78               7.06          57.7 TRUE     
##  6 Medium                        8.07               7.03          38.0 TRUE     
##  7 Medium                        9.00               7.35          52.8 TRUE     
##  8 Medium                        7.53               7.77          68.8 FALSE    
##  9 Medium                        7.42               7.52          79.6 FALSE    
## 10 High                          7.45               6.56          57.7 TRUE     
## # ℹ 285 more rows
# Evaluating the effect of outliers using correlation matrix:

# Select only numeric columns
numeric_cols_with_outliers <- select(data_with_outliers, where(is.numeric))
numeric_cols_without_outliers <- select(data_without_outliers, where(is.numeric))

# Build correlation matrices
cor_matrix_with_outliers <- cor(numeric_cols_with_outliers, use = "complete.obs")
cor_matrix_without_outliers <- cor(numeric_cols_without_outliers, use = "complete.obs")

# Convert matrices to long format
cor_with_outliers_long <- melt(cor_matrix_with_outliers)
cor_without_outliers_long <- melt(cor_matrix_without_outliers)

# Add a column to distinguish the datasets
cor_with_outliers_long$Dataset <- "With Outliers"
cor_without_outliers_long$Dataset <- "Without Outliers"

# Combine the datasets
cor_combined <- rbind(cor_with_outliers_long, cor_without_outliers_long)

# Plot using ggplot2 with facet_wrap
ggplot(cor_combined, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white", alpha= 0.8) +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +  
  facet_wrap(~ Dataset) +
  scale_fill_gradient2(low = "lightblue", high = "darkred", mid = "white", midpoint = 0, limit = c(-1, 1)) +
  theme_minimal() +
  labs(
    title = "Comparison of Correlation Matrices",
    x = NULL,
    y = NULL,
    fill = "Correlation"
  ) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

# The comparison of correlation matrices shows minimal differences between datasets with and without outliers. Given the similarity, we will retain the outliers and proceed with the original dataset for further analysis.

Visualisations

This section presents a series of visualizations to explore key metrics related to customer satisfaction in a furniture retail company. By examining the distribution and relationships of variables such as delivery time, staff satisfaction, and customer satisfaction, we aim to uncover patterns and insights that can inform strategic decisions. The visualizations include faceted histograms and scatter plots, providing a comprehensive view of the data’s spread, frequency, and correlations. These insights are crucial for understanding the factors that influence customer satisfaction and identifying areas for improvement.

# Reshape the data for faceting
customer_data_long <- customer_data %>%
  select(customer.satisfaction, staff.satisfaction, delivery.time) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Variable",
    values_to = "Value"
  )

# Create a faceted histogram
ggplot(customer_data_long, aes(x = Value)) +
  geom_histogram(
    fill = "skyblue", color = "black", bins = 30, alpha = 0.8
  ) +
  facet_wrap(~ Variable, scales = "free", labeller = as_labeller(c(
    "customer.satisfaction" = "Customer Satisfaction (1-10)",
    "staff.satisfaction" = "Staff Satisfaction (1-10)",
    "delivery.time" = "Delivery Time (Minutes)"
  ))) +
  theme_minimal() +
  labs(
    title = "Distribution of Key Variables",
    subtitle = "Faceted histograms showing the spread and frequency of key metrics",
    x = "Value",
    y = "Frequency"
  ) +
  theme(
    plot.title = element_text(size = 18, face = "bold"),
    plot.subtitle = element_text(size = 14),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),
    strip.text = element_text(size = 12, face = "bold")
  )

# The histograms reveal that customer satisfaction and delivery time exhibit nearly normal distributions, while staff satisfaction is slightly right-skewed. Delivery time shows a few outliers, potentially reflecting occasional service delays. Most satisfaction ratings are concentrated in the 6-8 range, indicating generally positive feedback.

# Reshape data to long format for faceting
customer_long <- customer_data %>%
  pivot_longer(cols = c(delivery.time, staff.satisfaction), names_to = "Variable", values_to = "Value")

# Create faceted scatter plots
p.variables.scatterplots <- ggplot(customer_long, aes(x = Value, y = customer.satisfaction)) +
  geom_point(alpha = 0.6, color = "dodgerblue", size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = "darkred", linetype = "dashed", size = 1) +
  facet_wrap(~ Variable, scales = "free_x") +
  labs(
    title = "Effect of Variables on Customer Satisfaction",
    x = "Value",
    y = "Customer Satisfaction"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 16),
    axis.title = element_text(face = "bold"),
    axis.text = element_text(size = 12),
    strip.text = element_text(face = "bold", size = 12),
    panel.grid.major = element_line(color = "gray85", linetype = "dashed")
  )

p.variables.scatterplots

# The scatter plots illustrate the effect of delivery time and staff satisfaction on customer satisfaction.
# Delivery Time: Shows a negative correlation with customer satisfaction, indicating that longer delivery times are associated with lower satisfaction.
# Staff Satisfaction: Displays a positive correlation, suggesting that higher staff satisfaction is linked to higher customer satisfaction.
# These insights highlight the importance of efficient delivery and staff morale in enhancing customer satisfaction.

# Convert SES to a factor
customer_data$SES_category <- factor(customer_data$SES_category, levels = c("Low", "Medium", "High"))

# Create an interaction plot
p.delivery.ses.interaction <- ggplot(customer_data, aes(x = delivery.time, y = customer.satisfaction, color = SES_category)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", size = 1) +
  labs(
    title = "Effect of Delivery Time on Customer Satisfaction Across SES Levels",
    x = "Delivery Time",
    y = "Customer Satisfaction",
    color = "SES"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 16),
    axis.title = element_text(face = "bold"),
    axis.text = element_text(size = 12),
    legend.title = element_text(face = "bold"),
    legend.position = "top"
  ) +
  scale_color_manual(values = c("Low" = "#1b9e77", "Medium" = "#d95f02", "High" = "#7570b3"))

p.delivery.ses.interaction

# The plot illustrates the effect of delivery time on customer satisfaction across different SES levels.
# Low SES: Shows a slight negative trend, indicating that longer delivery times are associated with lower satisfaction.
# Medium SES: Displays a moderate negative trend, suggesting a stronger impact of delivery time on satisfaction compared to low SES.
# High SES: Exhibits the steepest negative trend, indicating that delivery time has the most significant impact on customer satisfaction in high SES stores.
# Conclusion: The effect of delivery time on customer satisfaction varies by SES, with higher SES levels experiencing a more pronounced negative impact. This highlights the importance of efficient delivery, especially in high SES areas.

Statistical Analysis

This section explores the impact of delivery time, staff satisfaction, new product range, and SES on customer satisfaction using multiple linear regression, ANOVA, emmeans, and contrasts. We employ both NHST and estimation approaches to provide a robust analysis.

# Fit main effects model
main_effects_model <- lm(customer.satisfaction ~ delivery.time + staff.satisfaction + new_range + SES_category, data = customer_data)

# Summarize the model
summary(main_effects_model)
## 
## Call:
## lm(formula = customer.satisfaction ~ delivery.time + staff.satisfaction + 
##     new_range + SES_category, data = customer_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.59866 -0.67952 -0.01176  0.65469  2.89231 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.536176   0.616778   8.976  < 2e-16 ***
## delivery.time      -0.017220   0.004948  -3.480 0.000577 ***
## staff.satisfaction  0.351113   0.080457   4.364 1.77e-05 ***
## new_rangeTRUE       0.093878   0.112249   0.836 0.403643    
## SES_category.L      0.180853   0.097963   1.846 0.065878 .  
## SES_category.Q     -1.091799   0.110226  -9.905  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9687 on 294 degrees of freedom
## Multiple R-squared:  0.447,  Adjusted R-squared:  0.4376 
## F-statistic: 47.53 on 5 and 294 DF,  p-value: < 2.2e-16
# The model shows that delivery time negatively impacts customer satisfaction, while staff satisfaction has a positive effect. SES has a significant quadratic effect, indicating non-linear influences on satisfaction. New product range shows no significant impact.

# Fit interaction model
interaction_model <- lm(customer.satisfaction ~ delivery.time * SES_category + staff.satisfaction + new_range, data = customer_data)

# Summarize the interaction model
summary(interaction_model)
## 
## Call:
## lm(formula = customer.satisfaction ~ delivery.time * SES_category + 
##     staff.satisfaction + new_range, data = customer_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.69336 -0.63905 -0.02235  0.64181  2.70571 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.551345   0.611777   9.074  < 2e-16 ***
## delivery.time                -0.018720   0.004937  -3.792 0.000182 ***
## SES_category.L                1.573477   0.530916   2.964 0.003290 ** 
## SES_category.Q               -0.907512   0.505916  -1.794 0.073880 .  
## staff.satisfaction            0.359605   0.079847   4.504 9.67e-06 ***
## new_rangeTRUE                 0.104057   0.111400   0.934 0.351034    
## delivery.time:SES_category.L -0.022954   0.008596  -2.670 0.007999 ** 
## delivery.time:SES_category.Q -0.003307   0.008502  -0.389 0.697576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9603 on 292 degrees of freedom
## Multiple R-squared:  0.4602, Adjusted R-squared:  0.4473 
## F-statistic: 35.57 on 7 and 292 DF,  p-value: < 2.2e-16
# Calculate VIF for each predictor
vif_values <- vif(interaction_model)

# Display VIF values
print(vif_values)
##                                  GVIF Df GVIF^(1/(2*Df))
## delivery.time                1.055578  1        1.027413
## SES_category               845.639800  2        5.392577
## staff.satisfaction           1.277695  1        1.130352
## new_range                    1.006499  1        1.003244
## delivery.time:SES_category 843.885625  2        5.389778
# Since the high vif score for delivery times is because of it being an interaction term, centering it around zero.

# Center delivery time
customer_data$delivery.time_centered <- scale(customer_data$delivery.time, center = TRUE, scale = FALSE)

# Refit the model with centered variables
model_centered <- lm(customer.satisfaction ~ delivery.time_centered * SES_category + staff.satisfaction + new_range, data = customer_data)

# Check VIF again
vif_values_centered <- vif(model_centered)
print(vif_values_centered)
##                                         GVIF Df GVIF^(1/(2*Df))
## delivery.time_centered              1.055578  1        1.027413
## SES_category                        1.358172  2        1.079540
## staff.satisfaction                  1.277695  1        1.130352
## new_range                           1.006499  1        1.003244
## delivery.time_centered:SES_category 1.057980  2        1.014190
summary(model_centered)
## 
## Call:
## lm(formula = customer.satisfaction ~ delivery.time_centered * 
##     SES_category + staff.satisfaction + new_range, data = customer_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.69336 -0.63905 -0.02235  0.64181  2.70571 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            4.435698   0.543538   8.161 1.00e-14 ***
## delivery.time_centered                -0.018720   0.004937  -3.792 0.000182 ***
## SES_category.L                         0.205445   0.097814   2.100 0.036555 *  
## SES_category.Q                        -1.104618   0.110054 -10.037  < 2e-16 ***
## staff.satisfaction                     0.359605   0.079847   4.504 9.67e-06 ***
## new_rangeTRUE                          0.104057   0.111400   0.934 0.351034    
## delivery.time_centered:SES_category.L -0.022954   0.008596  -2.670 0.007999 ** 
## delivery.time_centered:SES_category.Q -0.003307   0.008502  -0.389 0.697576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9603 on 292 degrees of freedom
## Multiple R-squared:  0.4602, Adjusted R-squared:  0.4473 
## F-statistic: 35.57 on 7 and 292 DF,  p-value: < 2.2e-16
# Create a tidy data frame from the summary
tidy_model <- tidy(model_centered) %>%
  mutate(
    `Significance` = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      TRUE            ~ ""
    )
  ) %>%
  rename(
    `Coefficient` = estimate,
    `Std. Error` = std.error,
    `t-Statistic` = statistic,
    `P-Value` = p.value
  ) %>%
  select(term, Coefficient, `Std. Error`, `t-Statistic`, `P-Value`, Significance)

# Create and format the table
kable(tidy_model, format = "html", digits = 3, caption = "Regression Model Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  footnote(
    general = "Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05"
  )
Regression Model Summary
term Coefficient Std. Error t-Statistic P-Value Significance
(Intercept) 4.436 0.544 8.161 0.000 ***
delivery.time_centered -0.019 0.005 -3.792 0.000 ***
SES_category.L 0.205 0.098 2.100 0.037
SES_category.Q -1.105 0.110 -10.037 0.000 ***
staff.satisfaction 0.360 0.080 4.504 0.000 ***
new_rangeTRUE 0.104 0.111 0.934 0.351
delivery.time_centered:SES_category.L -0.023 0.009 -2.670 0.008 **
delivery.time_centered:SES_category.Q -0.003 0.009 -0.389 0.698
Note:
Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05
# The centered interaction model shows that delivery time has a negative effect (t(292) = -3.79, p < 0.001), with each unit increase predicting a 0.019 decrease in customer satisfaction (CI = [-0.028, -0.009]).
# SES (Linear) positively affects satisfaction (t(292) = 2.10, p < 0.05), indicating higher satisfaction in higher SES levels (CI = [0.01, 0.40]).
# Staff satisfaction remains a strong positive predictor (t(292) = 4.50, p < 0.001), with each unit increase predicting a 0.36 rise in satisfaction (CI = [0.21, 0.51]).
# The new product range shows no significant effect (t(292) = 0.93, p = 0.35), with a small estimated increase in satisfaction (CI = [-0.12, 0.33]).
# The interaction between delivery time and SES (Linear) is significant (t(292) = -2.67, p < 0.01), indicating a stronger negative impact of delivery time in higher SES levels (CI = [-0.040, -0.006]).

# Compare models using ANOVA
anova_results <- anova(main_effects_model, model_centered)

# Tidy the ANOVA results
tidy_anova <- as.data.frame(anova_results) %>%
  mutate(
    `Significance` = case_when(
      `Pr(>F)` < 0.001 ~ "***",
      `Pr(>F)` < 0.01  ~ "**",
      `Pr(>F)` < 0.05  ~ "*",
      TRUE             ~ ""
    )
  ) %>%
  tibble::rownames_to_column("Model") %>%
  select(Model, Res.Df, RSS, Df, `Sum of Sq`, F, `Pr(>F)`, Significance)

# Create and format the table
kable(tidy_anova, format = "html", digits = 3, caption = "ANOVA Comparison of Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  footnote(
    general = "Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05"
  )
ANOVA Comparison of Models
Model Res.Df RSS Df Sum of Sq F Pr(>F) Significance
1 294 275.860 NA NA NA NA
2 292 269.257 2 6.603 3.58 0.029
Note:
Significance codes: *** p < 0.001, ** p < 0.01, * p < 0.05
# ANOVA results show that adding interaction effects between centered delivery time and SES significantly improves the model fit (F = 3.58, p = 0.029). This highlights the importance of SES in moderating the impact of delivery time on customer satisfaction.

# Set up contrasts for SES
contrasts(customer_data$SES_category) <- contr.treatment(levels(customer_data$SES_category))

# Use emmeans to test contrasts
contrast_results <- emmeans(interaction_model, pairwise ~ SES_category | delivery.time)

# Display contrast results
summary(contrast_results)
## $emmeans
## delivery.time = 59.6:
##  SES_category emmean     SE  df lower.CL upper.CL
##  Low            6.32 0.1020 292     6.12     6.52
##  Medium         7.82 0.1070 292     7.61     8.03
##  High           6.61 0.0969 292     6.42     6.80
## 
## Results are averaged over the levels of: new_range 
## Confidence level used: 0.95 
## 
## $contrasts
## delivery.time = 59.6:
##  contrast      estimate    SE  df t.ratio p.value
##  Low - Medium    -1.498 0.155 292  -9.662  <.0001
##  Low - High      -0.291 0.138 292  -2.100  0.0915
##  Medium - High    1.208 0.148 292   8.167  <.0001
## 
## Results are averaged over the levels of: new_range 
## P value adjustment: tukey method for comparing a family of 3 estimates
# Manually create a dataframe for contrast results
contrasts_results <- data.frame(
  Contrast = c("Low - Medium", "Low - High", "Medium - High"),
  Estimate = c(-1.498, -0.291, 1.208),
  SE = c(0.155, 0.138, 0.148),
  df = c(292, 292, 292),
  t.ratio = c(-9.662, -2.100, 8.167),
  p.value = c("<.0001", "0.0915", "<.0001")
)

# Format the contrasts table
contrasts_table <- kable(
  contrasts_results, 
  format = "html", 
  digits = 3, 
  caption = "Pairwise Contrasts of SES Categories"
) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(1, bold = TRUE) %>%
  footnote(
    general = "P value adjustment method: Tukey."
  )

# Print the contrast table
contrasts_table
Pairwise Contrasts of SES Categories
Contrast Estimate SE df t.ratio p.value
Low - Medium -1.498 0.155 292 -9.662 <.0001
Low - High -0.291 0.138 292 -2.100 0.0915
Medium - High 1.208 0.148 292 8.167 <.0001
Note:
P value adjustment method: Tukey.
# Contrast results show that Medium SES stores have significantly higher customer satisfaction than Low and High SES stores (p < 0.0001). No significant difference between Low and High SES (p = 0.0915). Focus on maintaining satisfaction in Medium SES and improving Low and High SES stores.

Section 2

Customer Satisfaction Analysis Report

This report explores the factors influencing customer satisfaction in a furniture retail setting, focusing on delivery time, staff satisfaction, and socio-economic status (SES) of store locations. Using statistical models and visual analyses, the findings reveal that delivery time and staff satisfaction significantly impact customer satisfaction. Moreover, the influence of delivery time varies based on SES, highlighting the need for tailored operational strategies.

Data Overview

The dataset included 300 customer feedback records, featuring variables such as customer satisfaction (1-10 scale), delivery time (in days), staff satisfaction (1-10 scale), and SES category of the store (low, medium, high). After cleaning, no missing or duplicate data remained. Outliers were identified in delivery time but retained for analysis to preserve a holistic understanding of operational performance.

Key Findings

1. Delivery Time and Customer Satisfaction

The model shows that delivery time has a significant negative effect on customer satisfaction. The coefficient for delivery time is -0.018720 (t(292) = -3.79, p < 0.001), meaning that for each unit increase in delivery time, customer satisfaction decreases by 0.019 units. The confidence interval for this effect is [-0.028, -0.009], confirming that the relationship is statistically significant.

Recommendations: Focus on optimizing delivery times to enhance customer satisfaction, particularly by implementing efficient systems to reduce delays.

2. Staff Satisfaction and Customer Satisfaction

Staff satisfaction remains a strong positive predictor of customer satisfaction. The coefficient for staff.satisfaction is 0.359605 (t(292) = 4.50, p < 0.001), indicating that for each unit increase in staff satisfaction, customer satisfaction increases by 0.36 units. The confidence interval for this effect is [0.21, 0.51], confirming the strong, positive relationship.

Recommendations: Continue investing in staff satisfaction through training programs and incentives, as improving staff morale directly boosts customer satisfaction.

3. SES and Customer Satisfaction

SES (Linear) positively impacts customer satisfaction, with a coefficient of 0.205445 (t(292) = 2.10, p < 0.05), suggesting that higher SES levels are associated with higher customer satisfaction. The confidence interval for this effect is [0.01, 0.40], indicating a modest but significant positive effect.

Recommendations: Consider tailoring customer experiences and services to cater to the higher expectations in higher SES locations.

4. Impact of New Product Range

The introduction of the new product range shows no significant impact on customer satisfaction, with a t-value of 0.93 and a p-value of 0.35. The estimated increase in satisfaction is small, with a confidence interval of [-0.12, 0.33], indicating that this factor does not significantly influence customer satisfaction.

Recommendations: While the new range might not directly impact satisfaction, focus marketing efforts on raising awareness and complementing it with enhanced service quality.

5. Interaction Between Delivery Time and SES

The interaction between delivery time and SES (Linear) is significant, with a coefficient of -0.022954 (t(292) = -2.67, p < 0.01). This suggests that the negative impact of delivery time on customer satisfaction is more pronounced in higher SES areas. The confidence interval for this interaction is [-0.040, -0.006], further confirming the stronger sensitivity to delivery delays in higher SES locations.

Recommendations: Tailor delivery strategies based on SES categories: - High SES (Linear SES category): Prioritize fast delivery and premium services. - Medium/Low SES: Consider cost-effective and more flexible delivery options to accommodate customers’ expectations.

Conclusion

The analysis underscores the importance of operational efficiency, staff satisfaction, and SES-specific strategies in enhancing customer satisfaction. Key action points include:

  1. Delivery Optimization: Prioritize reducing delivery times, especially in high SES areas, where delays most negatively affect satisfaction.
  2. Staff Development: Invest in employee satisfaction initiatives to sustain superior service levels.
  3. Tailored Strategies: Adjust operational focus based on SES demographics for greater impact.
  4. Outlier Monitoring: Investigate and resolve recurring causes of extreme delivery delays to ensure consistent service quality.

By implementing these recommendations, the organization can significantly enhance customer experiences, drive satisfaction, and build long-term loyalty.